WiDS Checker: Combating Bugs in Distributed Systems
نویسندگان
چکیده
Despite many efforts, the predominant practice of debugging a distributed system is still printf-based log mining, which is both tedious and error-prone. In this paper, we present WiDS Checker, a unified framework that can check distributed systems through both simulation and reproduced runs from real deployment. All instances of a distributed system can be executed within one simulation process, multiplexed properly to observe the “happensbefore” relationship, thus accurately reveal full system state. A versatile script language allows a developer to refine system properties into straightforward assertions, which the checker inspects for violations. Combining these two components, we are able to check distributed properties that are otherwise impossible to check. We applied WiDS Checker over a suite of complex and real systems and found non-trivial bugs, including one in a previously proven Paxos specification. Our experience demonstrates the usefulness of the checker and allows us to gain insights beneficial to future research in this area.
منابع مشابه
WiDS: An Integrated Toolkit for Distributed System Development
Faced with a proliferation of distributed systems in research and production groups, we have devised the WiDS ecosystem of technologies to optimize the development and testing process for such systems. WiDS optimizes the process of developing an algorithm, testing its correctness in a debuggable environment, and testing its behavior at large scales in a distributed simulation. We have developed...
متن کاملReliable Design of Concurrent Software
Conventional static and dynamic testing techniques quickly break down if they are applied to distributed systems software. The bugs in such systems are usually triggered by irreproducible event sequences that can make debugging a nightmare. There are still tools that can help a programmer build a reliable system. One of the most popular such tools is the SPIN model checker. This article explain...
متن کاملSimple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
Large, production quality distributed systems still fail periodically, and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of un...
متن کاملMODIST: Transparent Model Checking of Unmodified Distributed Systems
MODIST is the first model checker designed for transparently checking unmodified distributed systems running on unmodified operating systems. It achieves this transparency via a novel architecture: a thin interposition layer exposes all actions in a distributed system and a centralized, OS-independent model checking engine explores these actions systematically. We made MODIST practical through ...
متن کاملPROGRAMMING Simple Testing Can Prevent Most Critical
Xu Zhao is a graduate student at the University of Toronto. He received a B.Eng. in computer science from Tsinghua University, China. At U of T, his research is focused on reliability and performance of distributed systems. [email protected] Large, production-quality distributed systems still fail periodically, sometimes catastrophically where most or all users experience an outage or d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007